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ABSTRACT 

De novo motif discovery has been an important chal- 
lenge of bioinformatics for the past two decades. 
Since the emergence of high-throughput techniques 
like ChlP-seq, ChlP-exo and protein-binding micro- 
arrays (PBMs), the focus of de novo motif discovery 
has shifted to runtime and accuracy on large data 
sets. For this purpose, specialized algorithms have 
been designed for discovering motifs in ChlP-seq 
or PBM data. However, none of the existing 
approaches work perfectly for all three high-through- 
put techniques. In this article, we propose Dimont, a 
general approach for fast and accurate de novo motif 
discovery from high-throughput data. We demon- 
strate that Dimont yields a higher number of correct 
motifs from ChlP-seq data than any of the 
specialized approaches and achieves a higher 
accuracy for predicting PBM intensities from probe 
sequence than any of the approaches specifically 
designed for that purpose. Dimont also reports the 
expected motifs for several ChlP-exo data sets. 
Investigating differences between in vitro and 
in vivo binding, we find that for most transcription 
factors, the motifs discovered by Dimont are in 
good accordance between techniques, but we also 
find notable exceptions. We also observe that 
modeling intra-motif dependencies may increase 
accuracy, which indicates that more complex motif 
models are a worthwhile field of research. 



INTRODUCTION 

New high-throughput techniques such as ChlP-seq (1), 
ChlP-exo (2) and protein-binding microarrays (PBMs) 
(3) have dramatically increased the amount and quality 



of data that can be used for de novo motif discovery. 
ChlP-seq experiments determine binding regions of 
DNA-binding proteins in vivo by cross-linking protein 
and DNA, immunoprecipitating the targeted protein and 
sequencing the bound fragments. In case of ChlP-exo, the 
fragments are shortened by an exonuclease before 
sequencing. PBMs allow for measuring probe-specific 
binding affinity in vitro for a huge number of systematic- 
ally chosen double-stranded probes. Despite the experi- 
mental differences, these approaches yield thousands of 
candidate binding regions together with a measure of con- 
fidence, which can be used for de novo motif discovery. 

Ma et al. (4) provide an extensive comparison of de novo 
motif discovery tools capable of using ChlP-seq data, 
where ChlPMunk (5) and POSMO (4) are the best-per- 
forming tools closely followed by DME (6), DREME (7) 
and MEME (8). A detailed comparison of de novo motif 
discovery tools using PBM data is given by Weirauch et al. 
(9), where FeatureREDUCE emerges as top-performing 
algorithm. However, there is no tool that works well for 
data from both experimental techniques (9). For ChlP-exo 
data, no specialized tool is currently available, and 
research resorts to well-established algorithms from the 
pre-NGS era (2). 

The lack of a universally applicable approach hampers 
the integration of data from different techniques and com- 
plicates the comparison of the resulting motifs, e.g. 
between in vivo and in vitro binding. Hence, we propose 
Dimont, a general approach for probabilistic discrimina- 
tive de novo motif discovery that is capable of handling 
ChlP-seq, ChlP-exo and PBM data. 

The runtime of most probabilistic de novo motif discov- 
ery tools is mainly determined by iteratively evaluating the 
likelihood. As the positions of the binding sites within the 
target sequences are unknown (hidden variables), these 
tools need to consider all admissible binding site positions 
for evaluating the likelihood, which has a decisive influ- 
ence on runtime. One approach to circumvent this 
problem is to resort to k-raer enumeration methods like 
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POSMO (4), which yields a competitive runtime even on 
large data sets. Dimont implements an alternative 
approach that allows for adhering to probabilistic 
methods using the popular 'zero or one occurrence per 
sequence' (ZOOPS) model of many de novo motif discov- 
ery tools (8,10-13) while achieving acceptable runtimes. 
Dimont uses that only a few binding sites are buried 
within long target sequences. In most probabilistic 
approaches, this results in a big discrepancy between the 
number of finally predicted binding sites and the number 
of positions that need to be evaluated for computing the 
likelihood, and in wasting a considerable amount of 
runtime during training. 

Hence, we only consider those positions contributing the 
most to the likelihood of a target sequence (Figure 1). 
During optimization, we dynamically determine the pos- 
itions to be evaluated keeping the learning scheme flexible 
to adapt to the positions of potential binding sites. This 
acceleration scheme allows for using all ChIP binding 
regions or all PBM probe sequences for de novo motif dis- 
covery instead of limiting the input data to a fixed number 
of (high-confidence) sequences (14). 

As peak occupancies or probe intensities contain 
valuable information for motif discovery, Dimont 
converts these to soft labels reflecting the a priori prob- 
ability of a sequence being bound. These soft labels are 
used for learning parameters by a weighted variant (15) of 
the discriminative maximum supervised posterior prin- 
ciple (16,17). 

In previous studies, the complexity of motif models was 
limited mostly owing to the limited amount of data. For 
this reason, simple models including consensus sequences 
as well as position weight matrices and sequence logos as 
their graphical representation are widespread. However, 
due to the enormous amount of high-throughput data, 
more complex models including inhomogeneous Markov 
models of higher order, which have been proven advanta- 
geous for other binding sites (18,19), can be used for de 
novo motif discovery and prediction of transcription factor 
binding sites. Hence, we include the capability of learning 
higher-order inhomogeneous Markov models into 
Dimont. 

We implement Dimont within the open-source Java 
library Jstacs (20). We provide a Dimont web server at 
http://galaxy.informatik.uni-halle.de and a stand-alone 
command line application at http://www.jstacs.de/index. 
php/Dimont. 



MATERIALS AND METHODS 

The input data of Dimont are DNA sequences 
x = x\,...,Xl where each symbol x e is from the DNA 
alphabet E = {A,C,G, T}. Each of the sequences is 
assigned some measure of evidence that reflects how 
likely this sequence is bound by the transcription factor 
of interest. In case of ChlP-seq and ChlP-exo data, such a 
measure is the number of reads or fragments under a ChIP 
peak, often termed 'peak statistic' or 'peak occupancy'. 
For PBM data, such a measure is the signal intensity of 
the probe sequence on the microarray. 



1 



Position 

Figure 1. Normalized likelihood profile of a sequence. The red dashed 
line visualizes the threshold that is used to accelerate the algorithm. All 
positions with peaks above the threshold are included in £, and all 
remaining positions are not used for evaluating the likelihood. 

In the following, we assume that high-confidence se- 
quences, i.e. those with a high peak statistic or a high 
signal intensity, contain a binding site of the motif of 
interest with substantially higher probability than low- 
confidence sequences. Hence, we transform these 
measures to probabilities that reflect how likely a 
sequence is bound by the transcription factor as explained 
in 'Soft labels from peak statistics and signal intensities' 
section. For ChlP-seq and ChlP-exo data, we additionally 
assume that binding sites of the targeted transcription 
factor occur clustered around the centers of the ChIP 
peaks. Hence, we use a non-uniform position distribution 
over the binding site positions in the Dimont model, which 
we introduce in 'Dimont models and objective function'. In 
subsequent sections, we describe how we accelerate the 
optimization of the parameters of the Dimont model, we 
outline the complete Dimont algorithm, and we introduce 
the performance measures and data sets used in the case 
studies of this article. 

Soft labels from peak statistics and signal intensities 

We map the peak statistics of ChIP data and the signal 
intensities of PBM data to soft labels that reflect the prob- 
ability assumed a priori of being bound by the targeted 
factor. For this reason, we refer to the probabilty of being 
bound as 'foreground probability' and to the converse 
probability as 'background probability'. 

Here, we propose a mapping that is based on the ranks 
of the signals within a data set. We denote as r„ the rank of 
the «-th sequence x n in the data set. Let m — max„jr„) be 
the maximum rank, and let h„ = — be the relative rank. We 
set q to the a priori fraction of sequences that receives a 
foreground probability greater than 0.5, and we refer to q 
as 'weighting factor'. The value of q can be adapted to the 
characteristics of the data, for instance, the significance 
level of accepted ChlP-seq peaks. In general, it is reason- 
able for any data source to also include low-confidence 
sequences into the input data to preserve the discrimina- 
tive nature of Dimont. In our studies, we use q = 0.2 for 
ChIP data and q = 0.01 for PBM data. We define the 
foreground probability of sequence x n as 



and the background probability as w b n g :— 1 — u{ g . For 
simplicity reasons, we refer to the sequences in 
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conjunction with the foreground probability as 'fore- 
ground' and to the same sequences in conjunction with 
the background probability as 'background'. 



Dimont models and objective function 

Dimont is based on the popular ZOOPS model used in 
many de novo motif discovery tools (8,10-13). In Dimont, 
the motif model is a uniform mixture model over the DNA 
strands using an inhomogeneous Markov model of user- 
specified order, which includes the position weight matrix 
(PWM) model (21,22) for order 0 and the weight array 
matrix model (18,19) for order 1. We give a detailed def- 
inition of the likelihood of the motif model in Section 1 of 
the Supplementary Material. 

In addition, we use a non-uniform position distribution 
P(£) over all possible binding site positions relative to 
an anchor position. More specifically, we use a Gaussian 
distribution with given initial standard deviation of 75 
around the anchor position for ChlP-seq, ChlP-exo and 
PBM data (details in Section 2 of the Supplementary 
Material). 

For positions not covered by a binding site, we use a 
uniform distribution. Hence, the likelihood Pf g {x\X) of an 
input sequence x, given the ZOOPS model with param- 
eters X is defined as 



1 -P(motif|X) 



(2) 



where /'(motif | X) denotes the a priori probability 
of observing a motif in a sequence, C is the set of admis- 
sible positions, initially set to [l,L — w+l], and 
Pinixe, ■ ■ ■ ,xt+ w -\ | X) denotes the likelihood of the motif 
model of width w. During optimization, we adapt C ac- 
cording to the acceleration scheme described in 
Accelerated discriminative learning section. 

As background model P/, g (x | X), we use either a uni- 
form distribution or a homogeneous Markov model of 
order d. 

To optimize the parameters of these models, we intro- 
duce a weighted variant (15) of the discriminative 
maximum supervised posterior principle (17,16,23) to de 
novo motif discovery, i.e. 



N 



( 



X = argmax > > M « 1°8 



= 1 ceC 



P(c\X)P c (x n \X) 



\ 



\ceC 



X) 



-Q(X\a), 



(3) 



where C = {fg,bg} is the set of classes, and Q(X \ a) denotes 
the prior on the parameters X given hyper-parameters a. In 
case of Dimont, this prior is a transformed product-Dirichlet 
prior (23) using BDeu hyper-parameters (24,25) based on an 
equivalent sample size of 4 for the foreground class and 
4 ■ for the background class. Parameter optimization is 
performed numerically using conjugate gradients. 



Accelerated discriminative learning 

We achieve an acceleration of parameter optimization by 
two general ideas. First, we perform a pre-optimization of 
parameters using a 'reduced data set' containing the 
highest-confidence sequences of foreground and back- 
ground class (Hence, these sequences also correspond to 
the lowest-confidence sequences of the alternative class). 
To this end, we select the 30% of the sequences, but not 
> 1000 sequences in total, obtaining the most extreme 
probabilities and H'* ? , respectively. We select these 
sequences such that the proportion of foreground and 
background probabilities is approximately identical to 
the full data set by successively adding sequences with 
the highest uff and w^f, respectively. 

Second, we observe that only few binding sites are 
detected within long target sequences as exemplarily 
depicted in Figure 1. A large proportion of runtime is 
wasted while evaluating the likelihood of the motif 
model for positions that will never be predicted as poten- 
tial binding sites. Hence, we only use the most relevant 
positions corresponding to the largest summands in 
Equation (2) instead of computing all terms. For this 
reason, we compute and normalize all summands of 
Equation (2) for each sequence x yielding 



P{l)P M {x l ,...,x l+K ^\X) 



L-w+l 

£ P(£)PM{x b 

1=1 



(4) 



120 



We then rank the positions I by y e in descending order. This 
rank is different from the rank r n according to the peak 
statistics or signal intensities, respectively, as given by the 
biological experiment. Here, the rank reflects the prediction 
due to the statistical model. Subsequently, we select in des- 
cending order a set of relevant positions C until 
J2ieC Yi — 0-5, and we refer to this threshold as 'likelihood 
cutoff. During numerical optimization, we determine C at 
the beginning of each iteration using the current set of par- 
ameters X,. Evaluating the likelihood of Equation (2) in the 
numerical optimization, we only use the positions in C. 

The Dimont algorithm 

In the following, we describe the Dimont algorithm step 
by step. 

Pre-processing 

We read the input sequences including peak statistics or 
probe intensities, which we convert to soft labels. 

Initialization 

For initializing the motif model, we first enumerate all 
7mers that occur in the reduced data set. We then rank 
these 7mers by log(«/g) ■ rifg/ntg, where n 1g is the sum of the 
foreground probabilities u{f of all sequences x,, containing 
the current 7mer at least once, and n hg is the correspond- 
ing sum of the background probabilities w^ g . We filter the 
ranked 7mers by excluding redundant variants, which 
have a Hamming distance of <2 to better-ranked 7mers. 

Of the ranked and filtered 7mers, we select the top 50 
7mers and use each of these to initialize the core of a motif 
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model of initial width w such that the central positions 
obtain a probability of 0.9 for the corresponding nucleotide 
in the 7mer and a probability of 94 for the remaining nu- 
cleotides. The bordering positions are assigned a uniform 
distribution. We then evaluate the conditional likelihood, 
i.e. Equation (3) without the prior term, and choose the top 
m initial motifs with respect to conditional likelihood. 

Pre-optimization 

For each of the m initial motifs, we optimize the param- 
eters according to Equation (3) on the reduced data set 
using the accelerated optimization described in the 
previous section. We then rank the resulting motifs by 
the supervised posterior achieved in the optimization. 

Filtering motifs 

Initialization and pre-optimization may result in redun- 
dant motifs, e.g. shifted variants or reverse complemen- 
tary motifs. To reduce runtime, we filter such redundant 
motifs before the final optimization. We consider two 
motifs redundant if their score profiles, i.e. their y t for 
all positions £= l,...,L — w+1 show a Pearson correl- 
ation greater than 0.3, averaged over all sequences in the 
reduced data set. During this filtering step, we allow for 
shifts of the score profiles up to w in both directions. 
If two motifs are considered redundant, we keep the 
motif variant achieving the larger supervised posterior. 

Final optimization 

For each of those motifs that remain after the filtering 
step, we optimize the parameters with respect to 
Equation (3) on the complete input data set. 
Subsequently, we compute the Kullback-Leibler diver- 
gence (26) between the marginal distribution at each 
motif position and the nucleotide composition of the 
complete data set. We remove bordering motif positions 
as long as Kullback-Leibler divergence is below 0.2. If 
Kullback-Leibler divergence exceeds 0.8 for a bordering 
position, we expand the motif by one additional position. 
We then adjust the standard deviation of the position dis- 
tribution, and finally optimize the parameters with respect 
to Equation (3) on the complete input data set. Again, we 
rank the resulting motifs by the supervised posterior 
achieved in the optimization on the complete data set. 

Post filtering 

We finish the Dimont algorithm with a final filtering step 
in analogy to pre-optimization to eliminate redundantly 
reported motifs. 

Default parameters 

As default parameters of Dimont, we suggest (i) a motif 
model of order of 0, i.e. a PWM model; (ii) a uniform 
background model; (hi) a weighting factor of q = 0.2; 
(iv) an initial motif width of w = 15; and (v) m = 20 pre- 
optmization runs. We use these default parameters 
throughout this article if not stated otherwise. 

Performance measures 

For evaluating the performance of de novo motif discovery 
predictions, several measures have been used. For PBM 



data, we stick to the area under the receiver-operating 
characteristic curve (AUC-ROC) and Pearson correlation, 
as these have been used as final performance measures in 
the DREAM5 challenge (9). Pearson correlation is sensi- 
tive to monotone transformations of the predicted scores, 
while AUC-ROC is insensitive to such transformations. 
For maximizing the Pearson correlation, we search an 
adequate transformation, 



where r is the predicted score, namely, the likelihood ratio, 
and c is a free parameter. We optimize c to maximze the 
Pearson correlation on the training data and use this 
optimal value to transform the likelihood ratios of the 
test data. For computing AUC-ROC, probe sequences 
with a mean signal intensity >4 standard deviations 
above the experiment average are assigned to the 
positive class, and all other probe sequences are assigned 
to the negative class (9). 

Comparing the results of Dimont to other tools and 
between experiments, we use sequence logos as proposed 
by Ma et al. (4), the normalized Euclidean distance as 
proposed by Linhart et al. (27) and AUC-ROC. 

Data 

ChlP-seq data 

We obtain the ChlP-seq peaks (centers and peak statistics) 
of the 26 ChlP-seq data sets compiled by Ma et al. (4) 
from original publications (28-34). For the comparison 
of ChlP-seq and PBM data, we additionally obtain the 
ChlP-seq peaks of Foxol (GSM546525, (35)), GATA4 
(GSM558904, (36)), Tcf3 (GSM915177, unpublished) 
Tbx5 (GSM558908, (36)) and Tbx20 (GSM734426, (36)) 
from Gene Expression Omnibus (http://www.ncbi.nlm. 
nih.gov/geo/), and of Nr5a2 (SRP001796, (37)) from the 
hmChIP database ((38), http://jilab.biostat.jhsph.edu/ 
database/cgi-bin/hmCMP.pl). 

For each of these data sets, we download the genome 
sequences of the corresponding species and genome 
version (hgl8, mm8, mm9, dm3) from the UCSC 
Genome Browser (http://hgdownload.cse.ucsc.edu/down 
loads.html). For each ChlP-seq peak, we extract 1000 bp 
of genomic sequence centered around the given peak 
summit and annotate these sequences with the corres- 
ponding peak statistic. 

ChlP-exo data 

We obtain the ChlP-exo peaks (peak coordinate and oc- 
cupancy) from the supplement of Rhee and Pugh (2). For 
CTCF, we download the human genome sequences (hgl8) 
from the UCSC Genome Browser. In case of the three 
yeast data sets, we obtain the yeast genome (build 
19-Jan-2007) from the Saccharomyces Genome Database 
(http://www.yeastgenome.org/download-data). For each 
ChlP-exo peak, we extract, based on CW distance, 
200 bp (CTCF) or 100 bp (yeast factors) of genomic 
sequence centered around the given peak center, and we 
annotate these sequences with the corresponding peak 
occupancy. 
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PBM data 

We obtain the 40 tuning, the 66 training and the 66 
test PBM data sets of DREAM 5 challenge2 (http://wiki. 
c2b2.columbia.edu/dream/index.php/D5c2). For the com- 
parison of ChlP-exo and PBM data, we additionally 
obtain PBM data sets for Phdl (UP00351) and Rapl 
(UP00321) from the UniPROBE database (39,40) 
(http://thebrain.bwh.harvard.edu/uniprobe/). Of each 
probe sequence, we extract the first 40 bp, comprising 35 
unique base pairs and 5 bp of linker sequence. 

In case of the PBM data sets of DREAM5, we follow 
the proposal of Weirauch et al. (9) and use the mean signal 
intensities after spatial detrending and quantile 
normalization. 

RESULTS 

Runtime 

We assess the runtime of Dimont on all data sets con- 
sidered in this article on a standard laptop (Intel Core 
i7, ULV, dual core, 2Ghz) using standard parameters. 
In Figure 2, we plot the runtime of Dimont against the 
size of the input data set for different types of input data. 
For ChlP-seq data sets comprising sequences of length 
1000 bp, we observe runtimes of ~5min for medium 
sized data sets. On the largest ChlP-seq data set contain- 
ing 73 795 sequences of length 1000 bp, Dimont runs for 
lh 15min. Without the speed-up strategy described in 
Accelerated discriminative learning section, runtime 
would increase by a factor of 5 to 29 as shown for 
selected data sets, namely, KNI (504 sequences), c-Myc 
(3 413 sequences), KR2 (5 793 sequences) and FoxA2 
(11461 sequences). We give a detailed overview of 
runtime dependency on the speed-up strategy and motif 
order in Supplementary Figures S1-S3. 

For ChlP-exo data sets comprising sequences of length 
200 in case of CTCF and sequences of length 100 in case 
of the yeast data sets, runtime decreases substantially, and 
Dimont reports a motif after at most 5 min. 

In case of the PBM data containing ~40 000 probe se- 
quences of length 40 bp per data set, Dimont runs for 
2-8 min. 

ChlP-seq 

In a first case study, we assess Dimont using default par- 
ameters on the 26 ChlP-seq data sets of Ma et al. (4). In 
Figure 3, we present exemplary motifs for three of the 
factors considered, while the motifs reported for all data 
sets are available in Supplementary Figure S4. In addition 
to a visual comparison of the motifs discovered to those 
from the literature, we consider the normalized Euclidean 
distance (27) between the two motifs as a measure for their 
similarity. 

The motif of FoxA2 discovered by Dimont closely re- 
sembles the motif reported in the Jaspar database (41) 
with clear consensus GTAAACA (normalized Euclidean 
distance d = 0.06). The motif of Tcfcp211 is also recovered 
well by Dimont (d = 0.12), although minor differences are 
visible: the strength of conservation at some positions 
differs between the motif reported by Dimont and that 
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Figure 2. Runtime evaluation of Dimont on the data sets used in this 
article. We consider all ChlP-seq data sets (blue), ChlP-exo (red) and 
PBM (green) data sets used in this article. Upright triangles represent 
the runtime without the speed-up strategy, whereas reversed triangles 
represent the runtime using the speed-up strategy. Runtime decreases 
by a factor of 5 to 29 due to the speed-up strategy. 

of Jaspar. In addition, Dimont includes two additional 
positions with a slight preference for A into the motif, 
while the last conserved G, present in the Jaspar motif, 
is omitted. The latter might be an effect of the strand 
model of Dimont combined with the roughly palindromic 
structure of Tcfcp211. 

The motif of KNI (d — 0.20) is one of three motifs that 
are discovered from ChlP-seq data exclusively by Dimont 
(c.f. Supplementary Figure S4, Table 1). We find that the 
consensus of the Jaspar motif (AAANTAGAGCA) fits 
the motif discovered by Dimont. However, we find two 
notable differences between the two motifs. First, the 
sequence of As at the 5' end of the motif is more conserved 
in the Jaspar motif. Second, we find mildly conserved Gs 
at positions 4 and 12 of the motif reported by Dimont, 
which are not present in the Jaspar motif. 

We assess the performance of Dimont on all data sets of 
Ma et al. (4) by counting the number data sets for which 
Dimont successfully discovers the known motif for the 
targeted transcription factor. We define a discovery suc- 
cessful iff the normalized Euclidean distance between the 
predicted motif and the motif described in the literature 
(4,33,34,41,42) is smaller than 0.25. We give an overview 
of this assessment in Table 1, and we additionally include 
the number of motifs correctly discovered by POSMO (4), 
MEME (8), DME (6), ChlPMunk (5), HMS (42) and 
DREME (7) as reported by Ma et al. (4). All motifs dis- 
covered by Dimont are presented in Supplementary 
Figure S4. 

We find by comparing the discovered motifs to the lit- 
erature using the normalized Euclidean distance that 
Dimont discovers all 26 motifs. As reported by Ma et al. 
(4), POSMO and ChlPMunk discover 23 motifs; MEME, 
DME and DREME discover 22 motifs; and HMS dis- 
covers 12 motifs. Three motifs (CAD, E2fl and KNI) 
are discovered only by Dimont but by none of the 
previous approaches. 

Considering the average rank of correct predictions, we 
find that for 20 of the 26 data sets, Dimont reports the 
correct motif on rank 1. For the remaining six data sets 
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Figure 3. Three exemplary motifs discovered by Dimont on the FoxA2, Tcfcp211 and KNI data sets of Ma et al. (4) compared with the corres- 
ponding motifs from the Jaspar database. 



Table 1. Number of motifs successfully discovered by Dimont on the 
data sets compiled by Ma et al. (4) compared with the results of 
POSMO, ChlPMunk, MEME, DME, DREME and HMS 



Algorithm 


Total successes 


Average rank 


Dimont 


26 


1.23 


POSMO 


23 


1.00 


ChlPMunk 


23 


1.00 


MEME 


22 


1.32 


DME 


22 


1.45 


DREME 


22 


1.45 


HMS 


12 


1.00 



We define a discovery successful iff the normalized Euclidean distance 
between the predicted motif and the motif described in the literature is 
smaller than 0.25. 



(CAD, GT, KNI, KR1, KR2 and Nanog), Dimont reports 
the correct motif only on rank 2. Scrutinizing such cases 
(Supplementary Figure S4), we find that the first motif 
reported for Nanog does not show a clear similarity to 
other known motifs. For KNI, Dimont reports the 
binding motif of CAD on rank 1, which can be explained 
by substantial co-binding of KNI and CAD (32). For four 
Drosophila melanogaster data sets (CAD, GT, KR1, KR2), 
the first motif reported by Dimont is almost identical 
having consensus CAGGTAG. The same motif is also dis- 
covered by Dimont as a second motif for the HB1 and 
BCD data sets. This motif is bound by the Zelda (ZLD) 
transcription factor, a member of the so-called TAGteam 
(43). ZLD has been reported to play a key role in transcrip- 
tional activation during maternal-to-zygotic transition, and 
regions bound by ZLD in early development are later 
occupied by several specific transcription factor including 
BCD, CAD, GT, KR and HB (44). 

In summary, Dimont discovers all motifs of the ChlP- 
seq data sets compiled by Ma et al. (4), including three 
motifs that are not found by previous approaches. For the 
majority of data sets, Dimont returns the correct motif at 
rank 1, whereas rank 2 for the remaining data sets can 
often be explained by biological phenomena. 

ChlP-exo 

In a second case study, we investigate the capability of 
Dimont to discover motifs in ChlP-exo data. To this 



end, we consider four of the five ChlP-exo data sets 
compiled by Rhee and Pugh (2), human CTCF and 
Rapl, Rebl and Phdl from Saccharomyces cerevisiae. 
We exclude the Gal4 data set, as it contains only 15 
binding regions. 

We present the motifs reported by Dimont using default 
parameters for the yeast data sets in Figure 4. The motif 
discovered for Rapl closely resembles the core of the 'telo- 
meric' motif of Rapl found by Rhee and Pugh (2) and is 
an extended variant of the motif reported in Jaspar. In 
case of Rebl, the consensus TACCCG of the discovered 
motif is identical to the previously reported Rebl consen- 
sus (2,45) and highly similar to the Jaspar motif. For 
Phdl, Dimont finds a motif highly similar to the Phdl 
motif discovered by Zhu et al. (39) from PBM data and 
to the Phdl motif reported in Jaspar. Notably, this motif 
has not been discovered from these ChlP-exo data by 
Rhee and Pugh (2) using MEME for de novo motif 
discovery. 

For the human insulator CTCF, ChlP-exo as well as 
ChlP-seq data are available. We show a comparison of 
the motifs discovered by Dimont from the ChlP-seq 
and ChlP-exo data sets to the motif present in Jaspar in 
Figure 5. All three motifs are highly similar, whereas the 
level of conservation slightly differs for some positions. 

In summary, Dimont discovers the binding motifs of all 
four transcription factors from the ChlP-exo data sets 
considered. 

Protein binding microarrays 

In a third case study, we consider the applicability of 
Dimont to PBM data. To this end, we assess the perform- 
ance of Dimont on the data provided by DREAM5 
challenge2 (cf. PBM data section). In this challenge, the 
signal intensities of one PBM layout should be predicted 
based on the probe sequences and the signal intensities of 
all probes of another PBM layout. During the challenge, 
tuning data for both PBM layouts were provided for 
calibrating external parameters of the participating 
approaches. We use these tuning data to determine (i) 
the optimal order d of the background model and (ii) 
the optimal weighting factor q for PBM data, whereas 
the initial motif width and the number of pre-optimization 
runs are left at their default values (cf. The Dimont 
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Figure 5. Motifs discovered by Dimont on the ChlP-seq and ChlP-exo 
data sets of the human insulator CTCF compared with the CTCF 
motif from the Jaspar database. 



algorithm section). We present the results of these analyses 
in Figure 6. Regarding the order of the background 
model, we find consistently for motif orders 0 to 2 that 
prediction performance as measured by AUC-ROC and 
Pearson correlation increases up to a background order of 
4. From the first row of Figure 6, we also observe that 
motif order 1 (weight array matrix (WAM) model) 
performs consistently better than orders 0 and 2. 

Hence, we fix the motif order to 1 in the second row of 
Figure 6 and investigate the influence of the weighting 
factor q on the predictions performance for different back- 
ground orders. We find that for higher background 
orders, AUC-ROC increases with decreasing weighting 
factor. Considering Pearson correlation, a weighting 
factor of 0.01 performs slightly better than 0.005, 
whereas 0.02 reaches a comparable correlation for most 
background orders. 

Allowing for model selection with regard to motif order, 
we choose for each data set the motif order yielding the 
maximum AUC-ROC on the training data set and test the 
prediction performance on the corresponding test data set. 
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Figure 6. Influence of the choice of background order for different 
motif orders and weighting factor on the performance on the tuning 
data sets of the DREAM5 challenge. In the first row, we plot perform- 
ance against background order for motif orders 0, 1 and 2 and a fixed 
weighting factor of 0.01. In the second row, we plot performance 
against weighting factors for a uniform background model and back- 
ground orders 0 to 5, given a fixed motif order of 1. 



Doing so for the tuning data sets, the performance slightly 
increases yielding an AUC-ROC of 0.958 and a Pearson 
correlation of 0.714. 

Given these results on the tuning data, we fix the back- 
ground order to 4 and the weighting factor to 0.01 in the 
following analyses on the DREAM5 training and test 
data. We train Dimont for motif orders 0 to 2 on each 
of the 66 training data sets and allow for selection of motif 
order on the training data. Following the proposal 
of Weirauch et al. (9), we consider the average Pearson 
correlation cc and the average AUC-ROC roc over all 66 
test data sets, and we compute a final score as 
(cc/0.696 + {roc - 0.5)/(0.949 - 0.5))/2. Thereby, 0.696 is 
the maximum Pearson correlation, and 0.949 is the 
maximum AUC-ROC gained by any of the approaches 
considered by Weirauch et al. (9). 
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Table 2. Performance of Dimont on the DREAM 5 data compared 
with the best approaches according to Weirauch el al. (9) for predict- 
ing PBM signal intensities from probe sequence as measured by 
Pearson correlation, AUC-ROC and a combined final score 



Algorithm 


Pearson corr. 


AUC-ROC 


Final 


Dimont 


0.695 


0.951 


1.002 


FeatureREDUCE 


0.693 


0.949 


0.997 


TeamD 


0.691 


0.938 


0.984 


Team_E 


0.696 


0.906 


0.952 



In Table 2, we compare the prediction accuracy 
achieved by Dimont to that of the top performers accord- 
ing to Weirauch et al. (9), namely, FeatureREDUCE, 
Team_D and Team_E. The maximum Pearson correlation 
of 0.696 is gained by Team_E, whereas among the existing 
approaches, the maximum AUC-ROC of 0.949 is gained 
by FeatureREDUCE. We find that Dimont achieves a 
Pearson correlation of 0.695, which is slightly greater 
than the Pearson correlation of the top performer 
FeatureREDUCE but slightly smaller than the Pearson 
correlation gained by Team_E. Considering AUC-ROC, 
Dimont yields a slightly greater AUC-ROC than all of the 
existing approaches considered. However, because of the 
large variation between the different data sets, neither of 
these improvements can be considered significant. 

Combining Pearson correlation and AUC-ROC, Dimont 
yields a greater final score than FeatureREDUCE, 
Team_D, Team_E and all other approaches considered 
by Weirauch et al. (9). 

As model selection with regard to motif order further 
increases the prediction performance of Dimont, we 
consider the selected model orders for different families 
of transcription factors, and we give a complete list of 
chosen model orders in Supplementary Table SI. For 
most families, we do not find a clear preference for a 
specific motif order. Notable exceptions are the 
AT hook family, which appears to profit from second- 
order dependencies, the bHLH and nuclear receptor 
families showing a preference for motif order 1, and the 
C2H2 zinc finger family, which shows a slight shift to 
motif order 0 compared with all transcription factors. 
Motif order 0 is chosen for less than one-third of the 
data sets, whereas higher motif orders are preferred for 
more than two-thirds of the data sets. 

Comparison of de novo motif discovery using different 
experimental techniques 

Owing to general applicability of Dimont to ChlP-seq, 
ChlP-exo and PBM data demonstrated in the previous 
sections, we have the opportunity to investigate the con- 
sistency of the discovered motifs between in vitro and in 
vivo binding and between different technologies. To this 
end, we consider all transcription factors for which on the 
one hand PBM data and on the other hand ChlP-seq or 
ChlP-exo data are available, and CTCF for a ChlP-seq/ 
ChlP-exo comparison. 

In a first study, we run Dimont on the PBM data set 
and the corresponding ChIP data set using a PWM model 



and the standard parameters for each technology (ChlP- 
seq/ChlP-exo: uniform background, = 0.2, ir = 15, 
m = 20; PBM: background order d = 4, q — 0.01, 
if =15, »z = 20) and compare the resulting binding 
motifs. We present the results of this study in Figure 7. 
For many data sets, namely, Esrrb, Foxol, Gata4 and 
Zfx, we obtain largely similar motifs for both, PBM and 
ChlP-seq/ChlP-exo data. This indicates than in vitro 
binding assays like PBMs are a valuable technique to de- 
termine binding specificities that are also valid in vivo. For 
Nr5a2, Phdl, Rapl and Tcf3, we find minor differences 
between the PBM and the ChlP-seq/exo motif, which are 
basically different levels of conservation and differences in 
the number of flanking positions. We observe the greatest 
differences for the two T-box motifs, namely, Tbx5 and 
Tbx20. The PBM motifs of Tbx5 and Tbx20 are similar, 
both having consensus TNACACCT, and agree with in 
vitro T-box motifs from the literature (46,47). The ChlP- 
seq motifs for both factors differ substantially from their 
PBM counterparts and from each other. Although the 
reason for this observation remains unclear, a similar in 
vivo motif of Tbx20 and a similar discrepancy between in 
vitro and in vivo binding of Tbx20 has been reported 
before (47), which might indicate similar effects for 
other T-box factors including Tbx5. An alternative ex- 
planation might be that in vivo Tbx5 co-binds with 
another factor enriched in the top ChlP-seq peaks. 
However, increasing q up to 0.6 does not result in a dif- 
ferent motif, although a greater number of sequences are 
considered to be bound. 

In a second study, we consider classification across 
technologies as an additional indication of the compliance 
of in vitro and in vivo binding. In case of PBM data, we use 
the partitioning into positive and negative probe se- 
quences proposed in the DREAM5 challenge (9). For 
ChlP-seq and ChlP-exo data, the positive class contains 
the sequences around the top 500 ChIP peaks, and the 
negative class comprises 10 shuffled variants of each 
positive sequence preserving di-nucleotide content. We 
assess the classification performance across technologies 
and for model order 0 and 1 in a 10-fold cross-validation 
(details given in Section 6 of the Supplementary Material). 
For the assessment in each iteration of a cross-validation 
run, we use only the motif reported by Dimont at rank 1 . 

We use the Dimont classifiers obtained on the ChIP and 
PBM training data to classify both the PBM and ChIP 
data sets for the same transcription factor. For PBM data, 
we train the classifier using background order 4 as before 
but replace the background model by a uniform distribu- 
tion for testing to eliminate influences aside the motif 
model on classification performance. We present the 
results of this cross-validation in Table 3. 

For Esrrb, Foxol, Gata4, Nr5a2 and CTCF, the clas- 
sifiers applied to data from a different technology than 
used for training achieve a performance that is compar- 
able with the intra-technology case. In case of Tbx20 and 
Tbx5, we observe a considerably decreased performance in 
at least one direction of the cross-technology comparison, 
a result that is consistent with the previous statements on 
the motif level. Although the PBM classifiers for Tcf3 and 
Zfx show a decreased AUC-ROC for ChlP-seq test data, 
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Figure 7. Comparison of the motifs discovered by Dimont using PBM and ChlP-seq or ChlP-exo data. For Esrrb, Foxol, Gata4 and Zfx, we obtain 
largely similar motifs for PBM and ChlP-seq/ChlP-exo data, whereas we find minor differences for Nr5a2, Phdl, Rapl and Tcf3. In case of Tbx5 
and Tbx20, the motifs discovered from PWM and ChlP-seq data differ substantially. 



Table 3. Mean AUC-ROC of a 10-fold cross-validation 



Factor 


Order 0 


Order 


1 Order 0 


Order 1 


Order 0 


Order 


1 Order 


0 Order 1 








Test(ChlP-seq) 








Test(PBM) 






Train(ChlP-seq) 




Train(PBM) 


Train(ChlP-seq) 




Train(PBM) 


Esrrb 


0.922 


0.930* 


0.901* 


0.885 


0.896 


0.908 


0.861 


0.906* 


Foxol 


0.746 


0.768* 


0.752 


0.794* 


0.902* 


0.868 


0.957 


0.962 


Gata4 


0.787 


0.807* 


0.739 


0.777* 


0.974 


0.974 


0.983* 


0.979 


Nr5a2 


0.853 


0.858 


0.858 


0.866* 


0.910* 


0.864 


0.963 


0.965 


Tbx20 


0.772 


0.770 


0.512 


0.524 


0.570 


0.691* 


0.994 


0.990 


Tbx5 


0.629 


0.634 


0.604* 


0.591 


0.808* 


0.550 


0.992 


0.993 


Tcf3 


0.929 


0.925 


0.784 


0.807* 


0.973* 


0.886 


0.973 


0.977* 


Zfx 


0.723 


0.719 


0.556 


0.563 


0.950* 


0.942 


0.970 


0.967 








Test(ChlP-seq) 








Test(ChlP-exo) 






Train(ChlP-seq) 


Train(ChlP-exo) 


Train(ChlP-seq) 




Train(ChlP-exo) 


CTCF 


0.882 


0.881 


0.800 


0.806* 


0.909 


0.907 


0.877 


0.879 








Test(ChlP-exo) 








Test(PBM) 






Train(ChlP-exo) 




Train(PBM) 


Train(ChlP-exo) 




Train(PBM) 


Phdl 


0.634 


0.621 


0.632 


0.661* 


0.786 


0.889* 


0.962" 


0.957 


Rapl 


0.781* 


0.766 


0.800 


0.819* 


0.758* 


0.727 


0.823 


0.837 



We train Dimont on ChlP-seq, PBM or ChlP-exo data and apply each of the resulting classifiers to each of the available data sets for the same 
transcription factor. Comparing AUC-ROC for motif orders 0 and 1, the maximum is displayed in bold face, and significant differences are marked 
with an asterisk. 



the ChlP-seq classifiers for these data sets yield a compar- 
able performance on the PBM test data as for the ChlP- 
seq test data. In both cases, one explanation might be the 
low number of conserved motif positions (cf. Figure 7), 
which leads to a large number of random hits in the 
shuffled negative sequences. For Phdl and Rapl, the 
ChlP-exo classifiers yield lower AUC-ROC values on 
the PBM data than the PBM classifiers, whereas the 
converse combinations yield a classification that is com- 
parable with the ChlP-exo classifiers. 

In the previous section, we observed that increasing the 
motif order increases the prediction performance of 
Dimont for PBM data. The existence of PBM, ChlP-seq 
and/or ChlP-exo data for the same transcription factors 
allows for investigating whether this observation is due to 
artifacts of PBM data or due to true dependencies between 
adjacent positions of transcription factor binding sites. 



In Table 3, we find that classifiers trained on PBM data 
and applied to ChIP data often achieve a greater classifi- 
cation performance for motif order 1 than for motif order 
0, whereas the opposite tendency can be observed for the 
classifier trained on ChIP data and applied to PBM data. 
One explanation might be that the systematic design of 
PBMs combined with the large number of probe se- 
quences allows for capturing true dependencies between 
adjacent positions, whereas the dependencies learned 
from ChlP-seq data are also influenced by general 
dependencies in the long input sequences. An alternative 
explanation could be that different modes of binding exist 
for several transcription factors, where only one of these 
modes is relevant for in vivo binding, but both are repre- 
sented in PBM data. Such heterogeneities could be repre- 
sented by higher order motif models, but not by PWMs. 
We study the dependencies discovered by Dimont for all 
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data sets, which show a significantly greater AUC-ROC 
for motif order 1 than for motif order 0 for at least one 
combination of training and test data sets in 
Supplementary Figure S5-S13, and we compare the 
dependencies detected by Dimont to those detected by 
diChlPMunk (48) in Section 6 of the Supplementary 
Material. 



DISCUSSION 

New high-throughput techniques including ChlP-seq, 
ChlP-exo and PBMs have greatly increased the quality 
and amount of data that are available for de novo motif 
discovery. Specialized tools have been developed for dis- 
covering motifs in ChlP-seq data, and other tools have 
been developed for discovering motifs in PBM data. 
However, none of the current tools work perfectly 
across all of these techniques, which hampers integration 
of data from different techniques and cross-technology 
comparison of the resulting motifs. 

Hence, we developed Dimont, a tool for de novo motif 
discovery from ChlP-seq, ChlP-exo and PBM data using 
an accelerated discriminative learning scheme. We test 
Dimont on a collection of 26 ChlP-seq data sets and 
observe that Dimont discovers all of the expected motifs, 
where three of these motifs could not be discovered by any 
previous approach. Hence, we may state that Dimont is 
currently one of the best-performing approaches for de 
novo motif discovery from ChlP-seq data. Applying 
Dimont to ChlP-exo data sets of three yeast factors and 
human CTCF, the discovered motifs are in well accordance 
to the literature. We also assess the performance of Dimont 
on the PBM data of DREAM 5 challenge 2 and find that 
Dimont predicts signal intensities from PBM probe 
sequence with greater accuracy than previous approaches. 
Hence, we may state that Dimont is currently one of the 
best-performing approaches for predicting PBM intensity 
values from probe sequence. Against the background of 
these three benchmark studies, we may state that Dimont 
is a general approach for fast and accurate de novo motif 
discovery from ChlP-seq, ChlP-exo and PBM data. 
Although the runtime required by Dimont is greater than 
the runtime of the currently fastest approach, POSMO (4), 
we consider a maximum runtime of 1 h 15 min and a typical 
runtime of < 10 min acceptable after days or weeks of wet- 
laboratory work. 

We further investigate whether motifs discovered by 
Dimont from in vitro and in vivo data can be transferred 
from one technique to the other by comparing the dis- 
covered motifs and by cross-technology classification. 
For most transcription factors, we find a good generaliza- 
tion of the motifs discovered by Dimont, which indicates 
that in vitro experiments often yield motifs that are also 
valid for in vivo binding. However, we also observe sub- 
stantial differences between in vitro and in vivo binding for 
two transcription factors, namely, Tbx5 and Tbx20. 

For PBM data, we also observe that using an inhomo- 
geneous Markov model of order 1 instead of the popular 
PWM model substantially increases prediction perform- 
ance. We investigate whether this finding can also be 



transferred to ChlP-seq or ChlP-exo data. Indeed, we 
observe that increasing the motif order to 1 for de novo 
motif discovery from PBM data increases classification 
accuracy on PBM as well as ChlP-seq and ChlP-exo 
data in the majority of cases. 

These findings indicate that with the increased amount 
of data due to current high-throughput techniques, motif 
models capturing dependencies between motif positions 
may be of great value for predicting transcription factor 
binding sites, especially for predicting in vivo binding sites 
given in vitro training data. 

As Dimont is implemented in the open-source Java 
library Jstacs (http://www.jstacs.de), new models 
capturing such dependencies can flexibly be implemented 
and easily integrated into Dimont by advanced users. 

AVAILABILITY 

For instant use, we also provide a Dimont web server at 
http://galaxy.informatik.uni-halle.de and a stand-alone 
command line application at http://www.jstacs.de/index. 
php/Dimont. 
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